1 Preparatory steps for the Markdown file

Setting the options for knitr

library(knitr)
knitr::opts_chunk$set(results = 'asis',      # Can also be set at the chunk-level
                       echo = TRUE,
                       comment = NA,
                       prompt  = FALSE,
                       warning = FALSE,
                       message = FALSE,
                       fig.align = "center",
                       fig.width = 8.125,
                       out.width = "100%",
                       fig.path = "D:/Figures/figures_sumap/",
                       dev = c('png', 'tiff'),
                       dev.args = list(png = list(type = "cairo"), tiff = list(compression = 'lzw')),
                       cache = TRUE,
                       cache.path = "D:/cache/cache_sumap/",
                       autodep = TRUE)

options(width = 1000, scipen = 999, knitr.kable.NA = '')

Loading libraries…

library(dplyr)
library(tidyverse) # The tidy verse
library(gridExtra)
library(uwot)
#require(devtools)
#install_version("clues", version = "0.6.2.2", repos = "http://cran.us.r-project.org")
library(clues)
library(parallel)
library(plotly)
library(splitstackshape)

Cleaning the memory…

rm(list = ls(all.names = TRUE))

2 Loading and preparing the data

Loading previously defined objects (dataset, sets of predictors…

load(file = "Study_objects.RData")

3 Function to compute and display Supervised UMAP & Silhouette Scores

We define the function get_umap_and_silhouette() which takes as input a data frame (or tibble) which contains each raw initial features from the three sets (Bioacoustic, DCT, MFCC) and target labels (individual or type). The target parameter target_DV that specifies the supervised target information passes this target column to the UMAP model during fitting and it will make use of it to perform a supervised dimension reduction (S-UMAP).

The umap() function in the uwot package includes several others parameters to build a low-dimensional graph. The four main parameters are n_neighbors which determines the number of neighbors used in constructing the nearest neighbor graph, min_dist which determines the allowed separation between connected embeddings, n_components which is the dimensionality of the embedding space, and metric which defines the distance metric (e.g., Euclidean, cosine) used to define the distances between points in the high-dimensional data space. We used the default settings for each, i.e., 100 neighbors, a minimal distance of 0.01, two dimensions and the Euclidean distance metric.

UMAP is a stochastic algorithm. This means that different runs of UMAP may have variations. The function performs the dimensional reduction n times (with n = 100 by default) with different random seeds to compute a distribution of silhouette scores for the various observations. For each repetition, in order to estimate the quality of the partitioning provided by UMAP and thus quantify the degree of clustering of call types and individual signatures, we compute the values of the silhouette coefficients (Rousseeuw, 1987). We use to this end function get_Silhouette in the clues package.

The function returns both a list of SUMAP-reduced coordinates and a list of silhouette scores for the calls (for the n repetitions).

get_umap_and_silhouette <- function(df_umap, target_DV, target_algo, target_set, n_neighbors = 100, metric = "euclidean", min_dist = 0.01, n = 100) {

  df_umap <- df_umap %>%
    filter(algo == target_algo, set == target_set) %>%
    dplyr::select(-algo, -set)

  if (target_DV == "individual")
    other_V <- "type"
  
  if (target_DV == "type")
    other_V <- "individual"

  m <- df_umap %>%
    dplyr::select(-id, -individual, -type, -sequence)
  
  umap_list <- vector(mode='list', length = n)
  sil_list <- vector(mode='list', length = n)

  for (i in 1:n) {
    set.seed(i)

    umap_coords <- m %>%
      umap(n_neighbors = n_neighbors, n_components = 2, metric = "euclidean", min_dist = 0.01, 
           n_threads = detectCores() - 2, y = df_umap[[target_DV]]) %>%
      data.frame()
  
    umap_coords[[target_DV]] <- df_umap[[target_DV]]
    umap_coords[[other_V]] <- df_umap[[other_V]]
    umap_coords$sequence <- df_umap$sequence
    
    silhouette <- get_Silhouette(as.matrix(umap_coords[, 1:2]), umap_coords[[target_DV]], disMethod = "Euclidean")
    silhouette.df <- data.frame(sil = silhouette$s,
                                observed = umap_coords[[target_DV]],
                                id = df_umap %>% pull(id))

    umap_list[[i]] <- umap_coords %>% mutate(index = i)
    sil_list[[i]] <- silhouette.df %>% mutate(index = i)
  }

  return (list(umap_list = umap_list, sil_list = sil_list))
}

We also define the function get_sil_plot() which takes as input a data table which contains silhouette scores for all our calls and for a number of repetitions (referred to with the variable index in the data table) (see above) We first compute the average silhouette scores of the calls overt the repetitions, and order the table in order to visualize things properly. We also first compute a grand average for each level of the target variable in the initial SUMAP, then the standard deviation of the average score of each level over the repetitions. The values are then displayed in the figure.

get_sil_plot <- function(dis.silhouette.df) {
  
  silhouette.df <- dis.silhouette.df %>%
    group_by(observed, id) %>%
    summarize(sil = mean(sil)) %>%
    ungroup() %>%
    arrange(observed, -sil) %>%
    mutate(name = as.factor(1:nrow(.)))  

  means.per.group <- silhouette.df %>%
    mutate(name = as.integer(name)) %>%
    group_by(observed) %>%
    summarize(mean.group = round(mean(sil), 2), mean.name = mean(name), min = name[which.min(name)], max = name[which.max(name)])
  
  sd.per.group <- dis.silhouette.df %>%
    group_by(index, observed) %>%
    summarize(sil = mean(sil)) %>%
    ungroup() %>%
    group_by(observed) %>%
    summarize(sd.group = round(sd(sil), 3))

  means.per.group <- means.per.group %>% inner_join(sd.per.group)
  
  p <- silhouette.df %>%
    ggplot(., aes(x = name, y = sil, color = observed, fill = observed)) +
    geom_col() +
    labs(y = "Silhouette value",
       x = "",
       fill="Cluster",
       color="Cluster",
       title = paste0("Silhouette plot - ", "Raw features", "\n", "Bioacoustic, DCT, MFCC"),
       subtitle = paste0("Average silhouette value: ", round(mean(silhouette.df$sil), 2)) ) +
    scale_color_brewer(palette = "Paired") +
    scale_fill_brewer(palette = "Paired") +
    ggplot2::ylim(c(-1.1, 1.1)) +
    geom_text(data = means.per.group, aes(x = mean.name, y = mean.group, label = mean.group), color="black", hjust = -0.2, vjust = -0.3) +
    geom_segment(data = means.per.group, aes(x = min - 0.5, xend = max + 0.5, y = mean.group, yend = mean.group), size = 0.75, linetype = "dashed", color="black") +
    geom_segment(data = means.per.group, aes(x = (min + max) * 0.5, xend = (min + max) * 0.5, y = mean.group - 1.96 * sd.group, yend = mean.group + 1.96 * sd.group), size = 0.5, linetype = "dashed", color="black") +
    theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
    coord_flip() +
    guides(fill="none", color="none")
  
  return (p)
}

The function get_umap_plot() takes the SUMAP-reduced coordinates of the calls and the target variable for the supervised dimensionality reduction, and returns a scatter plot.

get_umap_plot <- function(umap_df, target_DV, title, size = 1, alpha = 0.5) {
   p_umap <- umap_df %>% ggplot(aes(x = X1, y = X2, col = .data[[target_DV]])) +
    geom_point(size = size, alpha = alpha) +
    scale_color_brewer(palette = "Paired") +
    guides(colour = guide_legend(override.aes = list(size = 2, alpha = 1.0))) +
    ggtitle(title)
   
   return (p_umap)
}

The function get_umap_plot() takes the SUMAP-reduced coordinates of the calls and the target variable for the supervised dimensionality reduction, and returns one scatter plot per level of the variable complementary to the target variable, i.e., type for individual, and individual for type.

get_umap_plot_decomposed <- function(umap_df, target_DV, title, size = 1, alpha = 0.5) {
  
  if (target_DV == "individual")
    other_V <- "type"
  
  if (target_DV == "type")
    other_V <- "individual"
  
  p_umap <- umap_df %>% ggplot(aes(x = X1, y = X2, col = .data[[target_DV]])) +
    geom_point(size = size, alpha = alpha) +
    scale_color_brewer(palette = "Paired") +
    guides(colour = guide_legend(override.aes = list(size = 2, alpha = 1.0))) +
    ggtitle(title) +
    facet_wrap(vars(.data[[other_V]]))
  
  return (p_umap)
}

4 S-UMAP & silhouettes for type

We call the function get_umap_and_silhouette() to collect results for type with all the predictors contained in our three main sets.

target_columns <- c("id", "individual", "type", "sequence", Bioacoustic_set, DCT_set, MFCC_set) %>% unique()
df_umap <- df %>%
  select(one_of(target_columns)) %>%
  mutate(algo = "Raw features", set = "Bioacoustic, DCT, MFCC")

results_type <- df_umap %>% get_umap_and_silhouette("type", "Raw features", "Bioacoustic, DCT, MFCC", n = 100)

We concatenate the elements of the list of tables silhouette scores into a single table.

distrib_silhouette_df_type <- do.call(rbind, results_type$sil_list)

4.1 S-UMAP plot

umap_type <- results_type$umap_list[[1]] %>% get_umap_plot("type", "UMAP - Raw features\nBioacoustic, DCT, MFCC")
umap_type

4.2 S-UMAP interactive plot

results_type$umap_list[[1]] %>% plot_ly(x = ~X1, y = ~X2, color = ~type, colors = "Paired", type = 'scatter', mode = 'markers',
              hoverinfo = "text",
              hovertext = paste("Type:", results_type$umap_list[[1]]$type,
                                "<br>Individual:", results_type$umap_list[[1]]$individual,
                                "<br>Sequence:",  results_type$umap_list[[1]]$sequence)) %>%  
  layout(
    plot_bgcolor = "#ffffff",
    legend=list(title=list(text='Type')), 
    xaxis = list( 
      title = "0"),  
    yaxis = list( 
      title = "1")) 

4.3 Treillis plot showing S-UMAP for each individual

umap_decomposed_type <- results_type$umap_list[[1]] %>% get_umap_plot_decomposed("type", "UMAP - Raw features\nBioacoustic, DCT, MFCC")
umap_decomposed_type

4.4 Silhouette plot

sil_type <- distrib_silhouette_df_type %>% get_sil_plot()
sil_type

5 S-UMAP & silhouettes for individuals

We call the function get_umap_and_silhouette() to collect results for individual, once again with all the predictors contained in our three main sets.

target_columns <- c("id", "individual", "type", "sequence", Bioacoustic_set, DCT_set, MFCC_set) %>% unique()
df_umap <- df %>%
  select(one_of(target_columns)) %>%
  mutate(algo = "Raw features", set = "Bioacoustic, DCT, MFCC")

results_individual <- df_umap %>% get_umap_and_silhouette("individual", "Raw features", "Bioacoustic, DCT, MFCC", n = 100)

We concatenate the elements of the list of tables silhouette scores into a single table.

distrib_silhouette_df_individual <- do.call(rbind, results_individual$sil_list)

5.1 S-UMAP plot

cols = c("#C84248", "#8569D5", "#5E738F", "#D1A33D", "#8A7C64", "#599861", "#89C5DA", "#DA5724", "#74D944", "#CE50CA")

umap_individual <- results_individual$umap_list[[1]] %>%
  get_umap_plot("individual", "UMAP - Raw features\nBioacoustic, DCT, MFCC") +
  scale_color_manual(values = cols)
umap_individual

5.2 S-UMAP interactive plot

results_individual$umap_list[[1]] %>% plot_ly(x = ~X1, y = ~X2, color = ~individual, colors = c("#C84248", "#8569D5", "#5E738F", "#D1A33D", "#8A7C64", "#599861", "#89C5DA", "#DA5724", "#74D944", "#CE50CA"), type = 'scatter', mode = 'markers',
              hoverinfo = "text",
              hovertext = paste("Type:", results_individual$umap_list[[1]]$type,
                                "<br>Individual:", results_individual$umap_list[[1]]$individual,
                                "<br>Sequence:",  results_individual$umap_list[[1]]$sequence)) %>%  
  layout(
    plot_bgcolor = "#ffffff",
    legend=list(title=list(text='Individual')), 
    xaxis = list( 
      title = "0"),  
    yaxis = list( 
      title = "1")) 

5.2.1 Treillis plot showing S-UMAP for each call type

umap_decomposed_individual <- results_individual$umap_list[[1]] %>%
  get_umap_plot_decomposed("individual", "UMAP - Raw features\nBioacoustic, DCT, MFCC") +
  scale_color_manual(values = cols)
umap_decomposed_individual

5.3 Silhouette plot

sil_individual <- distrib_silhouette_df_individual %>%
  get_sil_plot() +
  scale_color_manual(values = cols)
sil_individual

6 Final layout

grid.arrange(
  umap_type, umap_individual, 
  sil_type, sil_individual,
  nrow = 2, ncol = 2,
  heights = c(0.5, 1)
)

7 Environment

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
OS version: Windows 10 x64 (build 22621)

8 Packages

parallel: Support for Parallel computation in R version 4.1.3
stats: The R Stats Package version 4.1.3
graphics: The R Graphics Package version 4.1.3
grDevices: The R Graphics Devices and Support for Colours and Fonts version 4.1.3
utils: The R Utils Package version 4.1.3
datasets: The R Datasets Package version 4.1.3
methods: Formal Methods and Classes version 4.1.3
base: The R Base Package version 4.1.3
splitstackshape: Stack and Reshape Datasets After Splitting Concatenated Values version 1.4.8
plotly: Create Interactive Web Graphics via ‘plotly.js’ version 4.10.1
clues: Clustering Method Based on Local version 0.6.2.2
uwot: The Uniform Manifold Approximation and Projection (UMAP) Method for Dimensionality Reduction version 0.1.14
Matrix: Sparse and Dense Matrix Classes and Methods version 1.4-1
gridExtra: Miscellaneous Functions for “Grid” Graphics version 2.3
forcats: Tools for Working with Categorical Variables (Factors) version 0.5.2
stringr: Simple, Consistent Wrappers for Common String Operations version 1.4.1
purrr: Functional Programming Tools version 0.3.4
readr: Read Rectangular Text Data version 2.1.3
tidyr: Tidy Messy Data version 1.2.1
tibble: Simple Data Frames version 3.1.8
ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics version 3.4.0
tidyverse: Easily Install and Load the ‘Tidyverse’ version 1.3.2
dplyr: A Grammar of Data Manipulation version 1.0.10
knitr: A General-Purpose Package for Dynamic Report Generation in R version 1.41